Whether this is a great number or peculiar gadgets or glasses of cold vodka Martni, there are definitely many aspects constituting a great James Bond movie. Having watched A spy who loved me (1977) recently I’ve begun to wonder if that could be somehow assessed by the help of machine learning. This is my small summer data science project that aims to predict the average IMDB score for the newest 007 movie No time to die and see what contributes to the score.
More formally, we will be looking at a regression problem for a small dataset (24 items) and training a random forest using cross validation. Then we will examine the breakdown plot for a previously unseen example - the new movie. I will be using caret package for the model and DALEX for explanation.
data(JamesBond)
df <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
relocate(Avg_User_IMDB, .after = last_col())
I will be using the James Bond dataset from the HoRM package. See resources
Here we have the data of the previous 24 movies. I’m going to use 12 ouf
of the 18 available features, namely: Year, Movie, Bond, World_Adj,
Budget_Adj, Film_Length, Conquests, Martinis, BJB, Kills_Bond,
Kills_Others, Top_100, Avg_User_IMDB.
To clear out the feature names:
- World_Adj The film’s 2013-adjusted worldwide gross
(in 1000’s of U.S. dollars).
- Budget_Adj The film’s 2013-adjusted budget (in 1000’s
of U.S. dollars).
- Avg_User_IMDB The average user rating on IMDB
(www.imdb.com).
- Conquests The number of “conquests” by Bond in the
film.
- Martinis The number of martinis Bond drank in the
film.
- BJB The number of times Bond stated “Bond. James
Bond.” in the movie.
- Kills_Bond The number of people killed by Bond.
- Kills_Others The number of people killed in the film
by people other than Bond.
- Top_100 An indicator where a value of 1 means the
title song within the top 100 on the UK Singles Chart and the U.S.
Billboard Hot 100 and a value of 0 means it did not.
kable(df)
| Year | Movie | Bond | World_Adj | Budget_Adj | Film_Length | Conquests | Martinis | BJB | Kills_Bond | Kills_Others | Top_100 | Avg_User_IMDB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1962 | Dr. No | Sean Connery | 457928 | 7688 | 110 | 3 | 2 | 1 | 4 | 8 | 0 | 7.3 |
| 1963 | From Russia with Love | Sean Connery | 598624 | 15174 | 115 | 4 | 0 | 0 | 11 | 16 | 0 | 7.5 |
| 1964 | Goldfinger | Sean Connery | 935404 | 22468 | 110 | 2 | 1 | 2 | 9 | 68 | 1 | 7.8 |
| 1965 | Thunderball | Sean Connery | 1040693 | 66333 | 130 | 3 | 0 | 0 | 20 | 90 | 1 | 7.0 |
| 1967 | You Only Live Twice | Sean Connery | 775740 | 66035 | 117 | 3 | 1 | 0 | 21 | 175 | 1 | 6.9 |
| 1969 | On Her Majesty’s Secret Service | George Lazenby | 518736 | 50608 | 142 | 3 | 1 | 2 | 5 | 37 | 0 | 6.8 |
| 1971 | Diamonds Are Forever | Sean Connery | 664969 | 41274 | 120 | 1 | 0 | 1 | 7 | 42 | 1 | 6.7 |
| 1973 | Live and Let Die | Roger Moore | 846046 | 36603 | 121 | 3 | 0 | 1 | 8 | 5 | 1 | 6.8 |
| 1974 | The Man with the Golden Gun | Roger Moore | 459623 | 32965 | 125 | 2 | 0 | 2 | 1 | 5 | 0 | 6.7 |
| 1977 | The Spy Who Loved Me | Roger Moore | 710290 | 53636 | 125 | 3 | 1 | 1 | 31 | 116 | 1 | 7.1 |
| 1979 | Moonraker | Roger Moore | 672514 | 99134 | 126 | 3 | 1 | 1 | 12 | 69 | 0 | 6.2 |
| 1981 | For Your Eyes Only | Roger Moore | 498812 | 71514 | 127 | 2 | 0 | 2 | 18 | 36 | 1 | 6.8 |
| 1983 | Octopussy | Roger Moore | 437059 | 64102 | 131 | 2 | 0 | 1 | 15 | 43 | 1 | 6.5 |
| 1985 | A View to a Kill | Roger Moore | 329322 | 64730 | 131 | 4 | 0 | 2 | 5 | 57 | 1 | 6.2 |
| 1987 | The Living Daylights | Timothy Dalton | 390758 | 81749 | 130 | 2 | 2 | 1 | 13 | 29 | 0 | 6.7 |
| 1989 | License to Kill | Timothy Dalton | 292392 | 78637 | 133 | 2 | 1 | 1 | 10 | 13 | 0 | 6.5 |
| 1995 | Goldeneye | Pierce Brosnan | 542985 | 91404 | 130 | 2 | 1 | 1 | 47 | 25 | 0 | 7.2 |
| 1997 | Tomorrow Never Dies | Pierce Brosnan | 491098 | 159117 | 119 | 3 | 1 | 1 | 30 | 24 | 0 | 6.4 |
| 1999 | The World Is Not Enough | Pierce Brosnan | 504091 | 188130 | 128 | 3 | 1 | 2 | 27 | 43 | 0 | 6.3 |
| 2002 | Die Another Day | Pierce Brosnan | 557433 | 183255 | 133 | 2 | 2 | 1 | 31 | 20 | 1 | 6.0 |
| 2006 | Casino Royale | Daniel Craig | 686784 | 117465 | 144 | 2 | 3 | 1 | 11 | 11 | 1 | 7.9 |
| 2008 | Quantum of Solace | Daniel Craig | 638035 | 248014 | 106 | 1 | 6 | 0 | 16 | 15 | 1 | 6.7 |
| 2012 | Skyfall | Daniel Craig | 1120980 | 202240 | 143 | 3 | 1 | 1 | 26 | 26 | 1 | 7.8 |
| 2015 | Spectre | Daniel Craig | 864553 | 240803 | 148 | 3 | 1 | 1 | 30 | 205 | 1 | 6.8 |
In this section we will look at some interesting features. Hover over the bullet marker to get extra information about the movie.
plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie) # we want to preserve the dataframe's chronological movie ordering
m <- list(
l = 100,
r = 50,
b = 80,
t = 150,
pad = 2
)
fig <- plot_ly(plot_df, x = ~Year, y = ~Martinis, type= 'scatter', mode= 'lines+markers',
line = list(color = 'rgba(49,130,189, 1)', width = 4),
marker = list(color = 'rgba(49,130,189, 1)', size = 12),
hoverinfo = 'text',
text = ~paste(Movie, Year)) %>%
layout(title = "The number of Martini drinks Bond consumed in the movie",
font=list(size = 20),
margin = m,
xaxis = list(
showgrid = FALSE,
zeroline = FALSE,
tickfont = list(size = 16)
),
yaxis = list(
showgrid = FALSE,
zeroline = TRUE
))
fig
plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie) # we want to preserve the dataframe's chronological movie ordering
fig2 <- plot_ly(plot_df, x = ~Year, y = ~Kills_Bond, name = 'By Bond', type= 'scatter', mode= 'lines+markers',
line = list(color = 'rgba(102,0,0,1)', width = 4),
marker = list(color = 'rgba(102,0,0,1)', size = 12),
hoverinfo = 'text',
text = ~paste(Movie, Year)) %>%
add_trace(y = ~Kills_Others, type = 'scatter', mode= 'lines+markers',
name = "By Others",
line = list(color = 'middle gray', width = 4),
marker = list(color = 'middle gray', size = 12)) %>%
layout(title = "The number of kills",
font=list(size = 20),
margin = m,
xaxis = list(
showgrid = FALSE,
zeroline = FALSE,
tickfont = list(size = 16)
),
yaxis = list(
title.text = '',
showgrid = FALSE,
zeroline = TRUE
))
fig2
The most correlated feature with the average IMDB score is the World gross, which should be expected. It is interesting to note that Kills, Film Length and Budget are growing with Year.
fig_corr <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom, -Movie, -Bond) %>%
relocate(Avg_User_IMDB, .after = last_col()) %>%
cor() %>%
ggcorrplot(method = 'square', type = 'upper', outline.color = 'white', lab_size = 8,
title= "Feature Correlations",
legend.title = "Corelation",
tl.cex = 18,
lab=TRUE) +
theme( text = element_text(size = 22))
fig_corr
It is time to train our regressor. We will use random forest with cross-validation. Next, we will create a DALEX explainer to examine the Feature Importance.
data(JamesBond)
df <- JamesBond %>% select(-Movie, -US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
relocate(Avg_User_IMDB, .after = last_col())
train_control <- trainControl(method = "cv", number = 8, savePredictions = "all")
model_rf <- train(Avg_User_IMDB~. , data = df, method = "rf", ntree = 100, trControl = train_control)
explainer_rf <- DALEX::explain(model = model_rf,
label = "rf",
verbose = FALSE,
y = df$Avg_User_IMDB
)
print(model_performance(explainer_rf))
## Measures for: regression
## mse : 0.05157403
## rmse : 0.2270992
## r2 : 0.7990078
## mad : 0.1360167
##
## Residuals:
## 0% 10% 20% 30% 40% 50%
## -0.38593333 -0.30377500 -0.19552000 -0.12311000 -0.07096667 -0.03891667
## 60% 70% 80% 90% 100%
## -0.01014000 0.07145333 0.13466333 0.31706333 0.49035000
plot(model_parts(explainer_rf, loss_function = loss_root_mean_square)) + theme(text = element_text(size = 20))
I found the necessary statistics from No time to die (2021). As some were hard to find, I assumed the values of Martinis and BJBs.The gross and budget were adjusted to 2013 for consistency:
no_time_to_die <- data.frame(2021, "Daniel Craig", 608510000, 214930000, 163, 0, 1, 1, 66, 109, 1, 0)
colnames(no_time_to_die) <- colnames(df)
df <- rbind(df, no_time_to_die)
kable(no_time_to_die)
| Year | Bond | World_Adj | Budget_Adj | Film_Length | Conquests | Martinis | BJB | Kills_Bond | Kills_Others | Top_100 | Avg_User_IMDB |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2021 | Daniel Craig | 608510000 | 214930000 | 163 | 0 | 1 | 1 | 66 | 109 | 1 | 0 |
predict(explainer_rf, tail(df, n=1))
## 25
## 7.347983
pb = predict_parts(explainer = explainer_rf, tail(df, n=1), type= "break_down")
plot(pb)
We can examine the contribution of each feature of our example, the 25th 007 movie. In fact, we are very close to the true average IMDB rating! (https://www.imdb.com/title/tt2382320/, August 2022).
Please note that we used a very small dataset.
James Bond R dataset https://search.r-project.org/CRAN/refmans/HoRM/html/JamesBond.html
How to use DALEX with caret https://htmlpreview.github.io/?https://github.com/ModelOriented/DALEX-docs/blob/master/vignettes/DALEX_caret.html
R Markdown Cookbook https://bookdown.org/yihui/rmarkdown-cookbook/
EMA book https://ema.drwhy.ai/
Daniel Craig’s retrospective https://youtu.be/2oZdJrph3RA